Goto

Collaborating Authors

 attention distribution


Bilinear Attention Networks

Neural Information Processing Systems

Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, co-attention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.


FastTransformerswithClusteredAttention SupplementaryMaterial

Neural Information Processing Systems

WefirstclusterthequeriesQusingtheK-means clustering to outputS which indicates the membership of queries to different clusters. The lower half of the figure shows the new valueˆVt computed by sparse dot-products with the keysK and values V corresponding tothe the top-k keys inT. Figure 6: We show training/validation loss convergence for different transformer variants. Both the clustered variants are have a significantly better convergence than bothlsh-1 and lsh-4. Note that due to a smaller batch sizefullmakesmanymoreupdates than allother transformer variants. In figure 6a, we show the training loss convergence for different transformer variants.


f6a8dd1c954c8506aadc764cc32b895e-Paper.pdf

Neural Information Processing Systems

Clustered attention makes use of similarities between queries and groups them in order to reduce the computational cost. In particular, we perform fast clustering using locality-sensitive hashing and K-Means and only compute the attention once per cluster.



A Data Collection and Details about the

Neural Information Processing Systems

We collected about 30 million text-image pairs from multiple channels, and built a 2.5TB new dataset (after tokenization, the size becomes about 250GB). The sources of data are basically classified into the following categories: (1) Professional image websites (both English and Chinese). The images in the websites are usually with captions. We have already introduced tokenizers in section 2.2, and here are some details. Colored grids are all the tokens attended to by the token marked "O".





GeneralizableMulti-LinearAttentionNetwork

Neural Information Processing Systems

The majority of existing multimodal sequential learning methods focus on how to obtain powerful individual representations and neglect to effectively capture themultimodal joint representation. Bilinear attention network (BAN) isacommonly used integration method, which leverages tensor operations to associate thefeatures ofdifferent modalities.